A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model

Background Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance. Results In this paper, we propose a novel deep learning model (CNN + Transformer) for identifying orphan genes in moso bamboo. It uses a convolutional neural network in combination with a transformer neural network to capture k-mer amino acids and features between k-mer amino acids in protein sequences. The experimental results show that the average balance accuracy value of CNN + Transformer on moso bamboo dataset can reach 0.875, and the average Matthews Correlation Coefficient (MCC) value can reach 0.471. For the same testing set, the Balance Accuracy (BA), Geometric Mean (GM), Bookmaker Informedness (BM), and MCC values of the recurrent neural network, long short-term memory, gated recurrent unit, and transformer models are all lower than those of CNN + Transformer, which indicated that the model has the extensive ability for OG identification in moso bamboo. Conclusions CNN + Transformer model is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. As our knowledge, this is the first model to adopt the deep learning techniques for identifying orphan genes in plants. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04702-1.

genome [3]. Currently, a growing number of OG are being identified in plants, including taxa such as Arabidopsis, Populus, Oryza sativa and sweet orange [4][5][6][7]. Many annotated OG are often differentially expressed in response to stresses and are considered to be determinant of species characteristics [8][9][10]. Hence, the identification of OG may provide a better understand for OG adaptation.
Moso bamboo (Phyllostachys edulis) belongs to the subfamily Bambusoideae of the Poaceae family; it shows characteristics of fast growth and excellent material production and can therefore be used to produce cloths, artwork, paper and food [11]. Recent studies have revealed that stresses such as drought and high temperature can affect the growth of moso bamboo as well as the yield and quality of moso bamboo shoots [12,13]. Although the identification of OG has been widely carried out in many plant species, a comprehensive understanding of OG is lacking in moso bamboo. Therefore, the discovery of OG is of great significance for subsequent research in this species.
Currently, orphan genes are generally obtained through BLAST sequence alignment, which compares the sequencing sequence (genome sequence, transcriptome sequence, etc.) of the studied species with other species, BLAST is a relatively reliable tool for identifying orphan genes [2]. BLAST, including BLASTP and tBLASTn, are often used as the alignment tools [14][15][16]. However, the use of these methods to identify OG requires considerable server and time resources and is greatly affected by the computational approach applied [17]. Orphan genes are widely distributed in plant species and generally exhibit significant differences in gene length, the number of exons, GC content, and expression level compared to protein-coding genes [3,6,10,16,18]. Therefore, orphan genes and non-orphan genes are distinguishable in terms of protein features. In traditional machine learning methods, features are often selected and extracted manually, which requires researchers to have prior domain knowledge and keen insight into the relationships between gene essentiality and types of biological data to obtain informative features to train the models. For example, Ying et al. [19] employed a machine learning-based approach to predict autism spectrum disorder (ASD) risk genes using human brain spatiotemporal gene expression signatures, gene-level constraint metrics, and other gene variation features. Recent study shows that using deep learning to extract features often achieves better results compared to its closest machine learning competitors, the majority of these deep learning algorithms rely on features extracted from raw sequences [20]. Hence, it would be significant to develop an efficient method that uses only protein sequences to train such models and produces credible predictive results.
As deep learning has gained popularity, it has been applied successfully in many bioinformatics fields, such as gene prediction [21] and medical image segmentation [22]. In particular, deep learning technology is used to automatically extract and learn abstract information from data to train a model; this approach shows superior performance and high adaptability and avoids complex feature engineering in natural language processing [23]. Moreover, protein sequences are very similar to natural language; the amino acids in proteins are similar to the words in natural language, and the same contextual relationship exists between amino acids in a protein sequence and as between words in a sentence. In this context, the prediction of OG can be considered a natural language processing problem.
Recently, transformer models have been widely used to address sequence problems. Zheng et al. [24] proposed the Segmentation Transformer (SETR) model, which regards semantic segmentation as a sequence-to-sequence prediction task. Zou et al. [25] proposed a transformer model for end-to-end target detection.
In this paper, OG are predicted by using hybrid deep learning based on convolutional neural networks and transformer neural networks. The raw protein sequences are first encoded as vectors or matrices by encoder or word2vec encoding, respectively. Then, protein features are extracted by CNN and transformer model. Finally, the extracted features are input into the fully connected neural network to generate the final recognition result. CNN + Transformer only uses protein sequences to train the models to predict moso bamboo OG. It uses two multicore convolution layers to capture high-frequency k-mer features in protein sequences. The extracted k-mer features are provided to the transformer layer, which captures the long-term interaction information between k-mer features through a multi-head self-attention mechanism.

Methods
Dataset BLAST (2.11.0+) [26] was used to identify OG based on previous studies [6,[27][28][29]. Moso bamboo protein sequences were downloaded from Bamboo GDB [30]. First, we used BLASTp to search for homologs of all 31,987 proteins annotated in moso bamboo in each of the other 136 plant species released in Phytozome v12.1 [31] with an e-value cutoff of 1e −5 . A total of 30,443 moso bamboo genes showed significant similarity to at least one sequence, which were defined as Evolutionarily Conserved genes (ECs) [32] and removed from further analysis (Fig. 1). Second, the remaining 1936 moso bamboo proteins for which no homologs could be found in any of the genomes were used for the next step of searches, which was performed by tBLASTn analysis. In this step, 392 moso bamboo genes were classified as ECs. The final sets of ECs and OG contained 30,443 and 1,544 genes (Additional file 1: Table S1), respectively ( Fig. 1).
To adequately train the deep learning model, the 1544 obtained OG were identified with label 1, and 30,443 ECs were identified with label 0. These genes were combined to form the moso bamboo orphan gene dataset.

Protein embedding
Protein sequences consist of possible 20 amino acids, each of which is represented by a capital letter. To make the protein sequence recognizable by a computer, the first step is to encode each amino acid in the protein according to Table 1, mapping each amino acid to a specific real number, where values of 1-20 represent the amino acid types, and unspecified or unknown amino acids are denoted as 21 [33,34]. The amino acid coding sequence in the table does not affect the experimental results. The sequence profiles thus obtained for each sequence search were processed by truncating the profiles of long sequences to a fixed length (L) and zero padding short sequences, a method that is widely used for data preprocessing and effective training [35]. As a result, we obtained a one-dimensional vector for each protein.
(1) s = (s 1 , s 2 , . . . , s L )s i ∈ {0, 1, . . . , 21} For protein sequences, by learning the dense continuous feature representation of each amino acid in the sequence, a distributional representation can be learned for the amino acids. When these embedding vectors are projected in 2D, it can be shown that amino acids with similarities in hydrophobicity, polarity and net charge, which are factors important for covalent chemical bonding, form visually distinguishable groups [36]. This validates the used of distributed representation as an effective method for encoding amino acids that also helps to preserve important physiochemical properties.
Hence, the sparse feature vectors (S) of a given protein sequence are transformed to dense continuous feature representations using word embedding transformation as  follows:F 1 e ∈ R L×e , where e corresponds to the embedding dimension. The self-attention mechanism in the transformer cannot distinguish between words at different positions, so the input sequence needs to be position-encoded to incorporate the positional information into the input sequence. The input sequence is then encoded with a two-dimensional matrix, F 2 e ∈ R L×e . F 1 e and F 2 e are added together as the input of CNN + Transformer, as follows: Here, S i,j corresponds to the j th word embedding number of the i th amino acid of the protein sequence after preprocessing.

Transformer model
The canonicalization model used in this work was based on a transformer architecture consisting of two separate stacks of layers for the encoder and decoder, respectively [37]. The structure of the encoder and decoder, as shown in Fig. 2, mainly includes a multihead attention mechanism layer, a feed-forward fully connected layer and normalization and residual connections [37,38].
The multi-head attention mechanism incorporates some portion of knowledge written in its internal memory (V) with indexed access by keys (K). When new data The structure of the encoder and decoder in the transformer model arrive (Q), the layer calculates attention and modifies the input accordingly, thus generating the output of the self-attention layer and weighting the parts that carry the essential information. The formulas of Q, K, and V are as follows: F c is the input matrix of the transformer model, W Q i is the query transformation matrix weight vector, W K i is the keyword transformation matrix weight vector, and W V i is the value transformation matrix weight vector.
Q and V perform dot product operations, and the result is divided by the scaling factor √ d The result is divided by the scaling factor and then multiplied by V following the application of the softmax function to obtain the result after self-attention. The output (H) of multi-head self-attention is obtained by splicing the self-attention N-head times, and the formula for multi-head self-attention is as follows.
where head i denotes the i-th self-attention mechanism (1 < = i < = N-head). LayerNorm [39] indicates layer normalization, which mainly serves to speed up the convergence of the model, while the residual network structure is used to reduce the learning load of the model with the following equation.
Dropout [40] is a stochastic deactivation strategy to prevent overfitting in models with a large number of parameters. The feed-forward layer enhances the nonlinear capability of the model with two layers of neural networks, and the transformer structure continues after the feed-forward layer with a normalization and residual layer, according to the following equation.
where O is the output vector of the transformer layer in the model, H′ is the normalized vector, and W (1) , W (2) , b (1) , and b (2) are the weight coefficients and bias of the 2-layer neural network, respectively.

CNN + Transformer model
An important disadvantage of the transformer model is its inefficiency in processing long sequences, mainly due to the computation and memory complexity of the selfattention module [41]. CNN can extract local features in the sequence to shorten the length of the sequence [42,43]. Therefore, this research proposes the CNN + Transformer model structure, which combines a CNN and a transformer model. As shown in Fig. 3, the proposed CNN + Transformer model structure is composed of two multicore convolution layers: a transformer layer and a fully connected layer.
After protein embedding, the protein sequences are encoded as dense, continuous vectors ( F e ) as the input of the CNN layer. The CNN layer in the model consists of two convolutional layers, with the first convolutional layer containing six convolution kernels, as shown in Fig. 3. The variable size of the filter in the convolution is designed to capture k-mer amino acid fragments, where k ranges from 2 (2 peptides) to 7 (7 peptides). The second layer consists of three convolution kernels, denoted as k n j n=1,2,3j=3,6,9 , where n represents the first few kernels, and j represents the size of the corresponding kernel. The kernel size is equal to the size of a convolutional window across j characters, and the parameters are tuned according to the training and validation step. The intermediate feature map of the i-th CNN layer is extracted as After obtaining the intermediate convolutional feature map ( F i m ), downsampling is performed using AvgPooling by taking the average of the output subregions of the CNN layer, which helps maintain the integrity of the information and facilitates subsequent The discrete raw sequence is transformed into a dense, continuous vector F e through feature embedding and then fed into the CNN layer with multi-scale convolution kernels to capture local amino acid k-mers features. The extracted characteristic map of the CNN layer is passed to Transformer neural network. According the multi-head self attention mechanism to capture the long-range interaction characteristics between k-mers. Finally, the Transformer outputs are passed to the fully connected layers to produce identification result global feature extraction. After average pooling, the output from all kernels is concatenated for another average pooling operation, which is used to generate the next layer of the feature maps ( F c ), with the following equation: where AvgPooling and Concat are average pooling operations and connection operations, respectively.
The transformer layer in the CNN + Transformer model is composed of three transformer encoder-decoder layers. In the transformer layer, the feature representation of long-range interaction information between amino acids is obtained by introducing a multi-head self-attention mechanism, and the values of two hyperparameters in the transformer layer, the number of self-attention heads (N-head) and the number of transformer layers (Num-layer), are discussed and explained in the "Results".
The output of the transformer layer is flattened to one dimension. The number of output hidden vectors in the fully connected layer is 2, which indicates the binary classification predicted by the model, and the output vector of the binary classification is transformed into the probability (p) by the ReLU activation function.
where σ is the activation function, l >= 1 is the number of layers of the multiconnected neural network, and W l and b l are the connection weights and biases of the hidden nodes from layer l − 1 to layer l in the fully connected layer, respectively.W l ∈ R n l−1 ×n l , n l−1 and n l are the numbers of hidden nodes in layers l − 1 and l , respectively. T (l) is the output hidden vector of layer l . W s and b s are the weights and biases, respectively, of the penultimate layer of the fully connected neural network, and W s ∈ R n l ×2 .

Implementation details
The loss function is cross-entropy loss function and using the Adam optimizer [44]. In the training set, there were approximately 20 times more moso bamboo OG than moso bamboo non-OG, which led to an imbalance problem during training. We explored a cost-sensitive technique for addressing the imbalance problem when training the classifier. The experiments were conducted by adding weights to the different categories of data during training according to the number ratio and using the category weights to train the model. CNN + Transformer involved multiple hyperparameters. These hyperparameters were tuned on the validation set using agrid search procedure. Their optimal values are mentioned below: 1. Embedding dimension: we tested for e ∈ {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} and found that the optimal model performance was obtained at e = 50. 2. Convolution filters: at the first convolution layer, we chose six convolution filters, s.t. f 1 k ∈ {2, 3, 4, 5, 6, 7} . This allowed us to capture amino acid k-mer frequencies for k-mers of lengths, k = 2 to k = 7. These k-mers represent the local contextual 'biological' words. For the second convolution layer, the optimal filter sizes were f 2 k ∈ {3, 6, 9} . This led to inference of interactions between amino acid k-mers i.e. detect frequencies of local contextual biological phrases consisting of two k-mers having same or different k. For example, the second convolution layer could apprehend interactions between two different dipeptides as well asestimate frequency of a biological phrase comprising a dipeptide and a tripeptide. All deep learning models were implemented in Pytorch (1.7.1) [45]. To speed up the training process, a GPU version of PyTorch on an NVIDIA Tesla P100 PCIe 16 GB system was used for the experiments.

Evaluation strategies
The identification of OG is an imbalance problem, and the number of non-OG in the moso bamboo orphan gene dataset is approximately 20 times greater than the number of OG. Although accuracy and F1 scores are very popular classification evaluation metrics, they can produce misleading results for unbalanced datasets because they do not take into account the ratio between positive and negative samples, and classifiers can achieve good results in terms of specificity but show a large number of false positives. An effective solution for overcoming the class imbalance issue comes from the MCC and BA [46]. When dealing with an unbalanced dataset, GM and BM are better performance metrics if the classification success rate is of concern [47]. Therefore, BA, BM, GM, and MCC were selected as the evaluation metrics in the experiment. The evaluation indicators used in the article are summarized as follows: Here, TP is the number of OG identified as OG, FN is the number of OG identified as non-OG, FP is the number of non-OG identified as OG, and TN is the number of non-OG identified as non-OG.

Dataset division
We use 70% of the data for training, 15% of the data for validation, and the remaining 15% of the data as holdout data for testing. We maintain the same ratio between the number of OG and non-OG during the training, validation, and testing of data. Each experiment is executed 10 times to obtain the average performance (i.e., tenfold cross-validation with 15% independent data as testing data for each run). The average performance for the independent holdout testing datasets is reported. The datasets that we use in the experiment are shown in Table 2.
Orphan genes are widely distributed in plant species and generally exhibit significant differences in gene length [2,10]. As can be seen from Fig. 4, the sequence length distribution of the Original Set, Training Set, Validation Set and Testing Set is similar, indicating that four data sets are comparable.

Performance comparison of different CNN + Transformer architectures
The CNN + Transformer model structure proposed in this study includes an embedding layer and two multicore convolutional layers: a transformer layer and a fully connected layer. Table 3 shows the performance comparison of CNN + Transformer under different model structures. From the table, we can see that the average BA value and GM value of the proposed CNN + Transformer architecture for the testing set can reach 0.877 and 0.881, respectively. Reducing the CNN layer or transformer layer in this structure will result in a decrease in model recognition performance. Specifically, when the transformer layer is removed from the original structure, the average total BA value and GM value are 0.773 and 0.784, respectively. When two multicore convolutional layers are removed from the original structure, the average total BA value and GM value are 0.844 and 0.832, respectively. In contrast, removing one 3-core convolutional layer or two multicore convolutional layers from the original structure will result in a slight decrease in the recognition performance of the model. After adding a fully connected layer of 256 neurons on the basis of the original framework, the recognition performance of the model is basically the same as that of the original framework, but the complexity of the model structure increases. When a 3-core convolutional layer is added on the basis of the original model structure, the average BA value and GM value are 0.853 and 0.837, respectively, and the recognition performance of the model declines.

The effect of hyperparameters on model performance
We evaluated the robustness of CNN + Transformer and elucidated the effect of two hyperparameters on the model: Max_len and Embedding_dim. The MCC values and balance accuracy were used to evaluate the model as the hyperparameters were adjusted. The first hyperparameter is the maximum sequence length, Max_len. From Fig. 5, it can be seen that when the value of Max_len is less than 1200, the values of MCC and BA are positively correlated with Max_len as a whole. At 1200, the average values of MCC and BA in the model are as high as 0.469 and 0.876, respectively, because some sequences with longer inputs are truncated at shorter lengths, which results in the loss of information contained in the sequences. However, when the sequence length continues to increase, the MCC value and BA value show a slight downward trend, indicating that simply increasing Max_len does not improve the recognition performance of the The second hyperparameter is the word embedding dimension, Embedding_dim. We take every 10 values from 0 to 100 as the word embedding dimension. As shown in Fig. 6, when Embedding_dim is equal to 50, the best mean MCC values and BA values of the model are highest; the best mean MCC values are 0.481 and 0.467, respectively; and the best mean BA values are 0.892 and 0.876, respectively. When Embedding_dim is equal to 50, increasing Embedding_dim further does not improve the performance of the model but increases the time to train the weights of the embedding matrix. Therefore, an Embedding_dim of 50 is selected for the experiment values.
Next, we study the effect of two other hyperparameters, N-head and Num-layer, in the transformer layer on model performance. Because the head number (N-head) of the hyperparameter multi-head self-attention mechanism must be divisible by Embedding_ dim, N-head values of 5 and 10 are chosen for the experiment. The numbers of layers of the encoder-decoder in the transformer hyperparameter are 2, 3, and 4. There are six combinations of N-head and Num-layer. Figure 7 shows the performance comparison of the models in the six different combinations. From the figure, we can see that in the CNN + Transformer model, when there are three layers of the transformer encoderdecoder and 10 heads of the attention mechanism, the best, average, and worst BA and MCC values of the model in the testing set are highest for the six combinations; the best,  Adding an encoder-decoder layer on this basis will cause the model performance to drop dramatically, with average BA and MCC values of 0.524 and 0.106, respectively. When one layer of the encoder-decoder is reduced or the number of multi-head attention mechanisms is reduced, the recognition performance of the model will also decrease.

Performance comparison with traditional deep learning models and traditional machine learning models
To verify the recognition performance of CNN + Transformer, we compared it with four basic models (RNN, LSTM, GRU and transformer) that are widely used in deep learning to process sequences and two traditional machine learning models (Support. Vector Machines (SVM) [48], Random Forest [49]). At the same time, we added the CNN layer in CNN + Transformer to fine-tuned RNN, LSTM and GRU models, and the results are shown in Table 4. Each deep learning model was weighted according to the ratio of OG to non-OG. All models were tested ten times, and the average of the ten test results was used for comparison. As shown in the table, CNN + Transformer  In terms of training time and testing time, the Transformer model has slightly higher training and testing time than the recurrent neural network model, which is caused by the computational complexity of multi-head self-attention mechanism in Transformer. Although the complexity of CNN + Transformer model structure is higher than that of Transformer model, the training and testing time of CNN + Transformer model is lower than that of Transformer model. Because the one-dimensional convolution layer of CNN + Transforme model performs preliminary feature extraction on the input feature matrix, the size of the feature matrix is compressed, thus reducing the computational complexity of the model. These results further prove that CNN + Transformer is an effective deep learning model for moso bamboo OG recognition.

CNN + Transformer model verification
The genome of moso bamboo has been updated to the second edition, which includes 50,936 protein sequences [50]. The model is tested on the dataset of second edition. Firstly, we obtained 1275 orphan genes (Additional file 2: Table S2) from the second edition of moso bamboo protein sequences through BLAST sequence alignment. Then, we input all the protein sequences of moso bamboo into the CNN + Transformer model, and the model identified 1466 orphan genes of moso bamboo.
In order to verify the reliability of CNN + Transformer model in identifying orphan genes of moso bamboo, we compared the 1466 orphan genes with 1275 orphan genes which were obtained by BLAST method. The results showed that 1106 protein sequences (Additional file 3: Table S3) were identical and our method had a high coincidence with BLAST results. To further validate the performance of CNN + Transformer model, we trained an optimal CNN + Transformer model using 70% data for training, 15% data for verification and 15% data for testing in the second version of moso bamboo orphan genes dataset. The test set contained 194 orphan genes identified by BLAST tools. The CNN + Transformer model identified 211 protein sequences as orphan genes in the test set, 183 of which were coincident with BLAST results. The above results indicated the reliability of CNN + Transformer in identifying orphan genes of moso bamboo.

OGs functional analyses
Functional annotation, classification and enrichment (GO, KEGG) analysis were performed by the BGI in-house customized data mining system called Dr.Tom (http:// report. bgi. com). The 1254 OGs of moso bamboo were searched against the GO database in order to categorize standardized gene functions. Some OGs were classified into "cellular process", "metabolic process", "catalytic activity", "binding", and "cell" (Fig. 8 (A)). In Fig. 8 (B), we performed GO enrichment analysis on OGs, functions such as "cell wall mannoprotein biosynthetic process", "box H/ACA snoRNA 3'-end processing", "mannose-6-phosphate isomerase activity" and "phosphoribosylformylglycinamidine cyclo-lingase activity" were enriched. According to KEGG pathway annotation, the KEGG pathway classification graph ( Fig. 8 (C)) and enrichment graph ( Fig. 8 (D)) are generated [51][52][53], phyper function in the R project was used to calculate P values and false discovery rates (FDRs). The 1254 OGs were divided into several categories ( Fig. 8 (C)). Among them, "Translation", "Folding, sorting and degradation", classification of OGs. The vertical axis represents the GO terms, and the horizontal axis represents the number of OGs. B Bubble graph for GO enrichment (the bigger bubble means the more genes enriched, and the increasing depth of blue means the differences were more obvious; q-value: the adjusted p-value). C KEGG (Kyoto Encyclopedia of Genes and Genomes) classification of OGs. D Bubble graph for KEGG enrichment "Carbohydrate metabolism" and "Environmental adaptation" were the most prominent. In Fig. 8 (D), we performed KEGG enrichment analysis on OGs, pathways such as "Circadian rhythm-plant", "Protein processing in endoplasmic reticulum", and "Plant-pathogen interaction" were enriched.

Conclusion
Using the sequence alignment method to identify OG in species is time-consuming and laborious, so it is a great challenge to design a robust and efficient model for identifying OG in species. In this study, we propose the sequence-based deep learning model CNN + Transformer with the aim of exploring whether deep learning shows better performance in the identification of moso bamboo OG (an unbalanced classification problem). The model uses a CNN to capture local k-mer amino acid features in the protein sequence and a transformer model to capture remote features between k-mer amino acids. CNNs are often used to capture local features, but they show some defects in effectively identifying the interdependence among long-distance input data. In contrast, in the model based on the transformer neural network, the long-term dependency relationships between local features are captured by introducing a multi-head self-attention mechanism.
The performance of CNN + Transformer was evaluated with a moso bamboo orphan gene dataset, and it achieved very good performance according to four comprehensive evaluation indexes: BA, GM, BM and MCC. Compared with four other models (RNN, LSTM, GRU, and transformer) that are widely used to address sequence problems in deep learning, the performance of CNN + Transformer was significantly superior, which further proved that CNN + Transformer is an effective gene recognition model for moso bamboo OG. At the same time, we combined the CNN layer of CNN + Transformer with RNN, LSTM, GRU and other models and made fine adjustments. The results showed that the recognition performance of the RNN, LSTM and GRU models declined to varying degrees after adding the CNN layer. The efficiency of the transformer model in capturing the correlation dependence between k-mer amino acids in the protein sequence was verified. Subsequently, we compared the results of CNN + Transformer and BLAST on moso bamboo orphan gene dataset of the second edition, and verified that CNN + Transformer is a reliable orphan gene identification model of moso bamboo. CNN + Transformer model was used to predict orphan genes directly from protein sequences, which was essentially different from BLAST method. Therefore, when researchers want to know whether some genes are orphan genes, CNN + Transformer can assist researchers to further confirm orphan genes as an effective tool. In the future, we will explore and integrate orphan gene data of other species to further improve the performance of CNN + Transformer. At the same time, we are interested in how to use deep learning to automatically learn features from biological data rather than manually extracting features heavily based on domain knowledge.